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Abstract 

We  present  a  simple,  one-pass  word  alignment  algorithm  for  parallel  text.  Our  algorithm  utilizes  synchronous  parsing  and  takes  advantage 
of  existing  syntactic  annotations.  In  our  experiments  the  performance  of  this  model  is  comparable  to  more  complicated  iterative  methods. 
We  discuss  the  challenges  and  potential  benefits  of  using  this  model  to  train  syntactic  parsers  for  new  languages. 


1.  Introduction 

Word  alignment  is  a  common  exercise  given  to  students 
learning  a  foreign  language.  Given  a  pair  of  sentences  that 
are  translations  of  each  other,  the  students  are  asked  to  draw 
lines  between  words  that  mean  the  same  thing. 

In  the  context  of  multi-lingual  natural  language  pro¬ 
cessing,  word  alignment  (more  simply,  alignment)  is  also 
a  necessary  step  for  many  applications.  For  instance,  it  is 
required  in  the  parameter  estimation  step  for  training  statis¬ 
tical  translation  models  (Al-Onaizan  et  ak,  1999;  Brown  et 
al.,  1990;  Melamed,  2000).  Alignments  are  also  useful  for 
foreign  language  resource  acquisition.  Yarowsky  and  Ngai 
(2001)  use  an  alignment  to  project  part-of-speech  (POS) 
tags  from  English  to  Chinese,  and  use  the  resulting  noisy 
corpus  to  train  a  reliable  Chinese  POS  tagger.  Their  result 
suggests  that  is  worthwhile  to  consider  more  ambitious  en¬ 
deavors  in  resource  acquisition. 

Creating  a  syntactic  treebank  (e.g.,  the  Penn  Tree- 
bank  Project  (Marcus  et  al.,  1993))  is  time-consuming  and 
expensive.  As  a  consequence,  state-of-the-art  stochastic 
parsers  which  rely  on  such  treebanks  are  available  only  in 
languages  for  which  they  are  available,  such  as  English.  If 
syntactic  annotation  could  be  projected  from  English  to  a 
language  for  which  no  treebank  has  been  developed,  then 
the  treebank  bottleneck  may  be  overcome  (Cabezas  et  al., 
2001). 

In  principle,  the  success  of  treebank  acquisition  in  this 
manner  depends  on  a  few  key  assumptions.  The  first  as¬ 
sumption  is  that  syntactic  relationships  in  one  language  can 
be  directly  projected  to  another  language  using  an  accurate 
alignment.  This  theory  is  explored  in  Hwa  et  al.  (2002b).  A 
second  assumption  is  that  we  have  access  to  both  an  English 
parser  and  word  aligner  that  can  perform  their  tasks  at  a 
sufficiently  high  level  of  quality.  Athough  high-quality  En¬ 
glish  parsers  are  available,  high-quality  aligners  are  more 
difficult  to  come  by.  Most  alignment  research  has  out  of 
necessity  concentrated  on  unsupervised  methods.  Even  the 
best  results  are  much  worse  than  alignments  created  by  hu¬ 
mans.  Therefore,  this  paper  focuses  on  producing  align¬ 
ments  that  are  tailored  to  the  aims  of  syntactic  projection. 
In  particular,  we  propose  a  novel  alignment  model  that, 
given  an  English  sentence,  its  dependency  parse  tree,  and 
its  translation,  simultaneously  generates  alignments  and  a 
dependency  tree  for  the  translation. 


Our  alignment  model  aims  to  improve  alignment  accu¬ 
racy  while  maintaining  sensitivity  to  constraints  imposed 
by  the  syntactic  transfer  task.  We  hypothesize  that  the 
incorporation  of  syntactic  knowledge  into  the  alignment 
model  will  result  in  higher  quality  alignments.  Moreover, 
by  generating  alignments  and  parse  trees  simultaneously, 
the  alignment  algorithm  avoids  irreconcilable  errors  in  the 
projected  trees  such  as  crossing  dependencies.  Thus,  our 
two  objectives  complement  each  other. 

To  verify  these  hypotheses,  we  have  performed  a  suite 
of  experiments,  evaluating  our  algorithm  on  the  quality  of 
the  resulting  alignments  and  projected  parse  trees  for  En¬ 
glish  and  Chinese  sentence  pairs.  Our  initial  experiments 
demonstrate  that  our  approach  produces  alignments  whose 
quality  is  comparable  to  those  produced  by  current  state-of- 
the  art  systems.  Moreover,  the  output  dependency  trees  are 
superior  to  those  produced  by  other  methods. 

We  acknowledge  that  the  strong  assumptions  we  have 
stated  for  the  success  of  treebank  acquistion  do  not  always 
hold  true  (Hwa  et  al.,  2002a;  Hwa  et  al.,  2002b).  There¬ 
fore,  it  will  also  be  necessary  to  devise  a  training  algorithm 
that  learns  syntax  even  in  the  face  of  substantial  noise  in¬ 
troduced  by  failures  in  these  assumptions.  Although  this 
last  point  is  beyond  the  scope  of  this  paper,  we  will  allude 
to  potential  syntactic  transfer  approaches  that  are  possible 
with  our  system,  but  infeasible  under  other  approaches. 

2.  Background 

Synchronous  parsing  appears  to  be  the  best  model 
for  syntactic  projection.  Synchronous  parsing  models  the 
translation  process  as  dual  sentence  generation  in  which  a 
word  and  its  translation  in  the  other  sentence  are  generated 
in  lockstep.  Translation  pairs  of  both  words  and  phrases  are 
generated  in  a  manner  consistent  with  the  syntax  of  their 
respective  languages,  but  in  a  way  that  expresses  the  same 
relationship  to  the  rest  of  the  sentence.  Thus,  alignment 
and  syntax  are  produced  simultaneously  and  induce  mutual 
constraints  on  each  other.  This  model  is  ideal  for  the  pursuit 
of  our  objectives,  because  it  captures  our  complementary 
goals  in  an  elegant  theoretical  framework. 

Synchronous  parsing  requires  both  parses  to  adhere  to 
the  constraints  of  a  given  monolingual  parsing  model.  If 
we  assume  context-free  grammars,  then  each  parse  must 
be  context-free.  If  we  assume  dependency  grammars,  then 


each  parse  must  observe  the  planarity  and  connectivity  con¬ 
straints  typical  of  such  grammars  (e.g.  Sleator  and  Temper- 
ley  (1993)). 

In  constrast,  many  alignment  models  (Melamed,  2000; 
Brown  et  ah,  1990)  rely  on  a  bag-of-words  model.  This 
model  presupposes  no  structural  constraints  on  either  input 
sentence  beyond  its  linear  order.  To  see  why  this  type  of 
model  is  problematic  for  syntactic  transfer,  consider  what 
happens  when  syntax  subsequently  interacts  with  its  out¬ 
put.  Projecting  dependencies  across  such  an  alignment  may 
result  in  a  dependency  tree  that  violates  planarity  and  con¬ 
nectivity  constraints  (Figure  1). 


Figure  1:  Violation  of  dependency  grammar  constraints 
caused  by  projecting  a  dependency  parse  across  a  bag-of- 
words  alignment.  Combining  the  syntax  of  Figure  la  with 
the  alignment  of  Figure  lb  produces  the  syntax  of  Figure 
Ic.  In  this  example,  the  link  (wi^ws)  crosses  the  link 
{w2tW5)  violating  the  planarity  constraint,  and  the  word 
W4  is  unconnected,  violating  the  connectivity  constraint. 

Once  the  fundamental  assumptions  of  the  syntactic 
model  have  been  breached,  there  is  no  clear  way  to  re¬ 
cover.  For  this  reason,  we  cannot  use  bag-of-words  align¬ 
ment  models,  although  in  many  respects  they  remain  state- 
of-the-art  for  alignment. 

A  canonical  example  of  synchronous  parsing  is  the 
Stochastic  Inversion  Transduction  Grammar  (SITV)  (Wu, 
1995).  The  SITV  model  imposes  the  constraints  of  context- 
free  grammars  on  the  synchronous  parsing  environment. 
However,  we  regard  context-free  grammars  as  problem¬ 
atic  for  our  task,  because  recent  statistical  parsing  mod¬ 
els  (Charniak,  2000;  Collins,  1999;  Ratnaparkhi,  1999) 
owe  much  of  their  success  to  ideas  inherent  to  dependency 
parsing.  We  therefore  adopt  an  algorithm  described  in  Al- 


shawi  and  Douglas  (2000).^  Their  algorithm  constructs 
synchronous  dependency  parses  in  the  context  of  a  domain- 
specific  speech-to-speech  translation  system.  In  their  sys¬ 
tem,  synchronous  parsing  only  enforces  a  contiguity  con¬ 
straint  on  phrasal  translations.  The  actual  syntax  of  the 
sentence  is  not  assumed  to  be  known.  Nevertheless,  their 
model  is  a  synchronous  parser  for  dependency  syntax,  and 
we  adopt  it  for  our  purposes. 

3.  Our  Modified  Alignment  Algorithm 

We  introduce  parse  trees  as  an  optional  input  to  the  al¬ 
gorithm  of  Alshawi  and  Douglas  (2000).  We  require  that 
output  dependency  trees  conform  to  dependency  trees  that 
are  provided  as  input.  If  no  parse  tree  is  provided,  our  al¬ 
gorithm  behaves  identically  to  that  of  Alshawi  and  Douglas 
(2000). 

3.1.  Definitions 

We  assume  as  input  a  parallel  corpus  that  has  been  seg¬ 
mented  into  sentence  pairs  (V  =  Vi...Vm,  W  =  Wi...Wn)- 
The  algorithm  iterates  over  the  sentence  pairs  producing 
alignments. 

We  define  a  dependency  parse  as  a  rooted  tree  in  which 
all  words  of  the  sentence  appear  once,  and  each  node  in 
the  tree  is  such  a  word  (Figure  2).  An  in-order  traver¬ 
sal  of  the  tree  produces  the  sentence.  A  word  is  said  to 
be  modified  by  any  words  that  appear  as  its  children  in 
the  tree;  conversely,  the  parent  of  a  word  is  known  as  its 
headword.  A  word  is  said  to  dominate  the  span  of  all 
words  that  are  descended  from  it  in  the  tree,  and  is  like¬ 
wise  known  as  the  headword  of  that  span.^  Subject  to  these 
constraints,  the  dependency  parse  of  V  is  expressed  as  a 
function py  :  {l...m}  {0...m}  which  defines  the  head¬ 

word  of  each  word  in  the  dependency  graph.  The  expres¬ 
sion  pv{i)  =  0  indicates  that  word  Vi  is  the  root  node 
of  the  graph  (the  headword  of  the  sentence).  The  depen¬ 
dency  parse  of  W,  pw  ■  {!...«}  {0...n}  is  defined 

analagously. 

An  alignment  is  expressed  as  a  function  a  :  {l...m} 
{0...n}  in  which  a{i)  =  j  indicates  that  word  Vi  of  V  is 
aligned  with  word  Wj  of  W.  The  case  in  which  a{i)  =  0  de¬ 
notes  null-alignment  (i.e.  the  word  Vi  does  not  correspond 
to  any  word  in  W).  Under  the  constraints  of  synchronous 
parsing,  we  require  that  if  a{i)  ^  0,  then  Pw{<i{i))  = 
a{pv{i)).  In  other  words,  the  headword  of  a  word’s  trans¬ 
lation  is  the  translation  of  the  word’s  headword  (Figure  3). 
We  also  require  that  the  analagous  condition  hold  for  the 
inverse  alignment  map  :  {l...n}  {0...m}. 

3.2.  Algorithm  Details 

Our  algorithm  (Appendix)  is  a  bottom-up  dynamic  pro¬ 
gramming  procedure.  It  is  initialized  by  considering  all 
possiblie  alignments  of  one  word  to  another  word  or  to  null. 

'An  alternative  to  dependency  grammar  is  the  richer  formal¬ 
ism  of  Synchronized  Tree-Adjoining  Grammar  (TAG)  (Shieber 
and  Schabes,  1990).  However,  Synchronized  TAG  raises  issues 
of  computational  complexity  and  has  not  yet  been  exploited  in  a 
stochastic  setting. 

^Elsewhere,  the  terms  connectivity  and  planarity  are  used  to 
define  these  constraints. 


Figure  2:  A  dependency  parse.  The  top  view  depicts  the 
sentence  in  a  tree  form  that  makes  the  dominance  and  head¬ 
word  relationships  clear  (^3  is  the  headword  of  the  sen¬ 
tence).  The  bottom  view  depicts  the  same  tree  in  more  fa¬ 
miliar  sentence  form,  with  the  links  drawn  above  the  words. 


Figure  3:  Synchronous  dependency  parses.  Notice  that  all 
dependency  links  are  symmetric  across  the  alignment.  In 
addition,  the  unaligned  word  UI3  is  connected  in  the  parse 
ofW. 


Alshawi  and  Douglas  (2000)  considered  alignments  of  two 
words  to  one  or  no  words,  but  we  found  in  our  evaluations 
that  restricting  the  initialization  step  to  one  word  produced 
better  results.  In  fact,  Melamed  (2000)  argues  for  in  favor 
of  exclusively  one-to-one  alignments.  However,  we  may 
later  explore  in  more  detail  the  effects  of  initializing  from 
multi-word  alignments. 

As  in  Alshawi  and  Douglas  (2000)  each  possible  one- 
to-one  alignment  is  scored  using  the  metric  (Gale  and 


Church.,  1991),  which  is  used  to  compute  the  correlation 
between  Vi  €  V  and  Wj  G  W  over  all  sentence  pairs 
(y,  W)  in  the  corpus.  In  Section  4.7.  we  consider  the  use 
of  (p  over  a  different  set  of  counts,  so  we  will  use  to  de¬ 
note  its  use  over  co-occurrence  counts  taken  from  the  cor¬ 
pus. 

To  compute  alignments  of  larger  spans,  the  algorithm 
combines  adjacent  subalignments.  During  this  step,  one 
subalignment  becomes  a  modifier  phrase.  Interpreting  this 
in  terms  of  dependency  parsing,  the  aligned  headwords  of 
the  modifier  phrase  become  a  modifiers  of  the  aligned  head¬ 
words  of  the  other  phrase.  At  each  step,  the  cost  of  the 
alignment  is  computed.  Following  Alshawi  and  Douglas 
(2000)  we  simply  add  the  cost  of  the  subalignments.  Thus 
the  overall  cost  of  any  aligned  subphrase  can  be  computed 
as  follows. 

^  4>\{Vi,Wj) 

(i,j):a(i)=j 

The  output  of  the  algorithm  is  simply  the  highest- 
scoring  alignment  that  covers  the  entire  span  of  both  V  and 
W. 

3.3.  Treatment  of  Null  Alignments 

Null  alignments  present  a  few  practical  issues.  For  ex¬ 
periments  involving  (f>\,  we  adopt  the  practice  of  counting 
a  null  token  in  the  shorter  sentence  of  each  pair.  ^  An  alter¬ 
native  solution  to  this  problem  would  involve  initialization 
from  a  word  association  model  that  explicitly  handles  nulls, 
such  as  that  of  Melamed  (2000). 

An  implication  of  the  synchronous  parsing  constraint 
given  in  Section  3.1.  is  that  null-aligned  words  must  be  leaf 
words  within  their  monolingual  dependency  graphs.  In  cer¬ 
tain  cases  this  may  not  lead  to  the  best  synchronized  parse. 
We  remove  this  condition.  Effectively,  we  consider  each 
sentence  to  consist  of  the  same  number  of  tokens,  some 
of  which  may  be  null  tokens,  (usually,  this  will  introduce 
null  tokens  into  only  the  shorter  sentence,  but  not  necessar¬ 
ily).  The  null  tokens  behave  like  word  tokens  with  regards 
to  the  synchronous  parsing  constraint,  but  they  do  not  im¬ 
pact  phrase  contiguity."^  In  only  the  resulting  surface  de¬ 
pendency  graphs,  we  remove  null  tokens  by  contracting  all 
edges  between  the  null  token  and  its  parent  and  naming  the 
resultant  node  with  the  word  on  the  parent  node.  Recall 
from  graph  theory  that  contraction  is  an  operation  whereby 
an  edge  is  removed  and  the  nodes  at  its  endpoints  are  con¬ 
flated.  ^  Thus,  word  tokens  that  modify  a  null  token  are 
interpreted  as  modifiers  of  the  the  null  token’s  headword. 
This  is  illustrated  in  Figure  4.  One  important  implication 
of  this  is  that  we  can  only  allow  a  null  token  to  be  the  head¬ 
word  of  the  sentence  if  it  has  a  single  modifier.  Otherwise, 
the  result  of  the  graph  contraction  would  not  be  a  rooted 
tree.  We  found  that  this  treatment  of  null  alignments  re¬ 
sulted  in  a  slight  improvement  in  alignment  results. 

^Srinivas  Bangalore,  personal  communication. 

null  token  is  considered  to  be  contiguous  with  any  other 
subphrase  -  another  way  to  view  this  is  that  a  null  token  is  an 
unseen  word  that  may  appear  at  any  location  in  the  sentence  in 
order  to  satisfy  contiguity  constraints. 

^see  e.g..  Gross  and  Yellen  (1999) 


4.  Evaluation 


Figure  4:  Effect  of  null  words  on  synchronous  parses.  In 
this  case,  word  W3  has  been  null-aligned  to  the  null  token 
Uq.  However,  Vq  can  still  participate  in  the  synchronous 
parse  produced  by  the  algorithm.  Once  the  structure  has 
been  completed,  the  edge  between  Vq  and  V3  (indicated  by 
the  dashed  line)  will  contract.  This  will  result  in  the  in¬ 
ferred  dependency  (indicated  by  the  dotted  line)  between 
Vi  and  ^3. 


3.4.  Analysis 

In  the  case  that  there  are  no  parses  available,  the  compu¬ 
tational  complexity  of  the  algorithm  is  0(m®n®),  but  with 
a  parse  of  V  (and  an  efficient  enumeration  of  the  subphrase 
combinations  allowed  by  the  parse)  the  complexity  reduces 
to  0{m^n).  If  both  parses  are  available  the  complexity 
would  be  reduced  to  0{mn). 

It  is  important  to  note  that  as  it  is  presented,  our  al¬ 
gorithm  does  not  search  the  entire  space  of  possible  alig- 
ment/tree  combinations.  Melamed  observes  that  two  mod¬ 
ifications  are  required  to  accomplish  this.®  The  first  mod¬ 
ification  entails  the  addition  of  four  new  loop  parameters 
to  enumerate  the  possible  headwords  of  the  four  monolin¬ 
gual  subspans.  These  additional  parameters  add  a  factor  of 
O(m^n^).  Second,  Melamed  points  out  that  for  a  small 
subset  of  legal  structures,  it  must  be  possible  to  combine 
subphrases  that  are  not  adjacent  to  one  another.  The  most 
efficient  solution  to  this  problem  adds  two  more  parameters, 
for  a  total  of  0(m®n®).  The  best  known  optimization  re¬ 
duces  this  to  0(m®n®).  This  is  far  too  complex  for  a  prac¬ 
tical  implementation.  As  such,  we  chose  to  use  the  origi¬ 
nal  algorithm  for  our  evaluations.  Thus  we  rec¬ 

ognize  that  our  algorithm  does  not  search  the  entire  space 
of  synchronous  parses.  It  inherently  incorporates  a  greedy 
heuristic,  since  for  each  subphrase,  it  considers  only  the 
most  likely  headword. 


®I.  Dan  Melamed,  personal  communication. 


We  have  performed  a  suite  of  experiments  to  evalu¬ 
ate  our  alignment  algorithm.  The  qualities  of  the  result¬ 
ing  alignments  and  dependency  parse  trees  are  quantified 
by  comparisons  with  correct  human-annotated  parses.  We 
compare  the  alignment  output  of  our  algorithm  with  that 
of  the  basic  algorithm  described  in  Alshawi  and  Douglas 
(2000)  and  the  well-known  IBM  statistical  model  described 
in  Brown  et  al.  (1990)  using  the  freely  available  imple¬ 
mentation  (Giza-H-)  described  in  Al-Onaizan  et  al.  (1999). 
We  found  that  our  model,  which  combines  the  statis¬ 
tic  with  syntactic  annotation,  performs  alignments  at  a 
level  comparable  to  the  complex  iterative  IBM  statistical 
model,  and  produces  better  dependency  trees  than  any  other 
method.  We  compare  these  trees  against  several  baselines 
and  against  projected  dependency  trees  created  in  the  man¬ 
ner  described  in  (Hwa  et  al.,  2002a). 

4.1.  Data  Set 

The  language  pair  we  have  focused  on  for  this  study  is 
English-Chinese.  The  training  corpus  consists  of  around 
56,000  sentence  pairs  from  the  Hong  Kong  News  parallel 
corpus.  Because  the  training  corpus  is  solely  used  for  word 
co-occurrence  statistics,  no  annotation  is  performed  on  it. 

The  development  set  was  constructed  by  obtaining  man¬ 
ual  English  translations  for  47  Chinese  sentences  of  25 
words  or  less,  taken  from  sections  001-015  of  the  Chinese 
Treebank  (Xia  et  al.,  2000).  A  separate  test  set,  consist¬ 
ing  of  46  Chinese  sentences  of  25  words  or  less,  was  con¬ 
structed  in  a  similar  fashion.^  To  obtain  correct  English 
parses,  we  used  a  context-free  parser  (Collins,  1999)  and 
converted  its  output  to  dependency  format.  To  obtain  cor¬ 
rect  Chinese  parses,  Chinese  Treebank  trees  were  converted 
to  dependency  format.  Both  sets  of  parses  were  hand- 
corrected.  The  correct  alignments  for  the  development  and 
test  set  were  created  by  two  native  Chinese  speakers  using 
annotation  software  similar  to  that  described  in  Melamed 
(1998). 

4.2.  Metrics  for  evaluating  alignments 

As  a  measure  of  alignment  accuracy,  we  report  Align¬ 
ment  Precision  (AP)  and  Alignment  Recall  (AR)  figures. 
These  are  computed  by  by  comparing  the  alignment  links 
made  by  the  system  with  the  links  in  the  correct  alignment. 
We  denote  the  set  of  guessed  alignment  links  by  Ga  and 
the  set  of  correct  alignment  links  by  Ca  ■  Precision  is  given 
by  AP  =  ■  Recall  is  given  by  AR  =  . 

We  also  compute  the  E-score  (AF),  which  is  given  by 
AF  =  ■  Null  alignments  are  ignored  in  all  com¬ 

putations.  Our  evaluation  metric  is  similar  to  that  used  by 
Och  and  Ney  (2000). 

4.3.  Metrics  for  evaluating  projected  parse  trees 

As  a  measure  of  induced  dependency  tree  accuracy,  we 
report  unlabelled  Chinese  Tree  Precision  (CTP).  This  is 

^These  sentences  have  already  been  manually  translated 
into  English  as  part  of  the  NIST  MT  evaluation  preview  (See 
http://www.nist.gov/speech/tests/mt/).  The  sentences  were  taken 
from  sections  038,  039,  067,  122,  191,  207,  249. 


Synchronous  Parsing  Method 

AP 

AR 

AP 

CTP 

sim-Alshawi  (^^) 

40.6 

36.5 

38.4 

18.5 

sim-Alshawi  (^^)  -i-  English  parse 

43.8 

39.3 

41.4 

39.9 

sim-Alshawi  (^^  )  +  English  parse  +  Chinese  bigrams 

42.9 

38.5 

40.6 

39.4 

sim-Alshawi  (^^)  -i-  both  bigrams 

41.5 

37.3 

39.3 

16.5 

Giza-H-  initialization  (^q) 

51.2 

45.9 

48.4 

11.6 

Giza-H-  initialization  {(I>q)+  English  parse 

49.6 

44.6 

47.0 

44.7 

Baseline  Method 

AP 

AR 

AP 

CTP 

Same  Order  Alignment 

15.7 

14.1 

14.8 

NA 

Random  Alignment  (avg  scores) 

7.8 

7.0 

7.4 

NA 

Porward-chain 

NA 

NA 

NA 

37.3 

Backward-chain 

NA 

NA 

NA 

12.9 

Giza-H- 

68.7 

40.9 

51.3 

NA 

Hwa  et  al.  (2002a) 

NA 

NA 

NA 

44.1 

Table  1;  Alignment  Results  for  All  Methods. 

AP  =  Alignment  Precision.  AR  =  Alignment  Recall.  AF  =  Alignment  F-Score.  CTP  =  Chinese  Tree  Precision. 
All  scores  are  reported  as  percentages  of  100. 

The  best  scores  in  each  table  appear  in  bold. 


computed  by  comparing  the  output  dependency  tree  with 
the  correct  dependency  trees.  We  denote  the  set  of  guessed 
dependency  links  by  Gp  and  the  set  of  correct  alignment 
links  by  Cp .  A  small  number  of  words  (mostly  punctuation) 
were  not  linked  to  any  parent  word  in  the  correct  parse; 
links  containing  these  words  are  not  included  in  either  Cp 
or  Gp.  Precision  is  given  by  CTP  =  \Cpr\Gp\/\Gp\.  For 
dependency  trees,  1(7^1  =  |Gp|,  since  each  word  contributes 
one  link  relating  it  to  its  headword.  Thus,  recall  is  the  same 
as  precision  for  our  purposes. 

4.4.  Baseline  Results 

We  first  present  the  scores  of  some  naive  algorithms  as 
a  baseline  in  order  to  provide  a  lower  bound  for  our  re¬ 
sults.  The  results  of  the  baseline  experiments  are  included 
with  all  other  results  in  Table  1.  Our  first  baseline  (Same 
Order  Alignment)  simply  maps  character  Vi  in  the  English 
sentence  to  character  Wi  in  the  Chinese  sentence,  or  in 
the  case  of  i  >  n.  Our  second  baseline  (Random  Align¬ 
ment),  randomly  aligns  word  Vi  to  word  wj  subject  to  the 
constraint  that  no  words  are  multiply  aligned.  We  report 
the  average  scores  over  100  runs  of  this  baseline.  The  best 
Random  Alignment  F-score  was  10.0%  and  the  worst  was 
5.3%  with  a  standard  deviation  of  0.9%. 

For  parse  trees,  we  use  two  simple  baselines.  In  the 
first  (Forward-Chain),  each  word  modifies  the  word  imme¬ 
diately  following  it,  and  the  last  word  is  the  headword  of  the 
sentence.  For  the  second  baseline  (Backward-Chain),  each 
word  modifies  the  word  immediately  preceding  it,  and  the 
first  word  is  the  headword  of  the  sentence.  No  alignment 
was  performed  for  these  baselines. 

The  final  baselines  relate  to  the  Giza-H-  algorithm.  This 
produces  the  best  result  for  alignment.  For  reasons  de¬ 
scribed  previously,  this  cannot  be  directly  used  for  projec¬ 
tion.  However,  Hwa  et  al.  (2002a)  contains  an  investigation 
in  which  trees  output  from  Giza-H-  are  modified  using  sev¬ 
eral  heuristics,  and  subsequently  improved  using  linguistic 


knowledge  of  Chinese.  We  report  the  Chinese  Tree  Preci¬ 
sion  obtained  by  this  method. 

4.5.  Synchronous  Parsing  Results 

Our  first  set  of  alignments  combines  the  cross- 
lingual  co-occurrence  metric  described  previously  with  ei¬ 
ther  English  parse  or  no  parse  trees.  In  this  set,  with 
no  parse  is  nearly  identical  to  the  approach  described  in  Al- 
shawi  and  Douglas  (2000)  (excepting  our  treatment  of  null 
alignments).  Thus,  it  serves  as  a  useful  point  of  comparison 
for  runs  that  make  use  of  other  information.  In  Table  1  we 
refer  to  it  as  sim-Alshawi. 

What  we  find  is  that  incorporating  parse  trees  results  in 
a  modest  improvement  over  the  baseline  approach  of  Al- 
shawi  and  Douglas  (2000).  We  notice  that  using  a  single 
parse  provides  a  very  slight  improvement  in  alignment,  but 
a  noticeable  improvement  in  induced  parse  trees. 

Why  aren’t  the  improvements  more  substantial?  One 
observation  is  that  using  parses  in  this  manner  results  in 
only  passive  interaction  with  the  cross-lingual  scores. 
In  other  words,  the  parse  filters  out  certain  alignments,  but 
cannot  in  any  other  way  counteract  the  biases  inherent  in 
the  word  statistics.  Nevertheless,  it  appears  to  be  modest 
progress. 

4.6.  Results  of  Using  Bigrams  to  Approximate  Parses 

The  results  suggest  that  using  parses  to  constrain  the 
alignment  is  helpful.  It  is  possible  that  using  both  parses 
would  result  in  a  more  substantial  improvement.  However, 
we  have  already  stated  that  we  are  interested  in  the  case  of 
asynchronous  resources.  Under  this  scenario,  we  only  have 
access  to  one  parse.  Is  there  some  way  that  we  can  approxi¬ 
mate  syntactic  constraints  of  a  sentence  without  having  ac¬ 
cess  to  its  parse? 

The  parsers  of  (Charniak,  2000;  Collins,  1999;  Ratna- 
parkhi,  1999)  make  substantial  use  of  bilexical  dependen¬ 
cies.  Bilexical  dependencies  capture  the  idea  that  linked 


words  in  a  dependency  parse  have  a  statistical  affinity  for 
each  other:  they  often  appear  together  in  certain  contexts. 
We  suspect  that  bigram  statistics  could  be  used  as  a  proxy 
for  actual  bilexical  dependencies. 

We  constructed  a  simple  test  of  this  theory:  for  each 
English  sentence  V  =  Vi...Vm  in  the  development  set  with 
parse  pv  ■  we  first  construct  the  set 

of  all  bigrams  B  =  ■  1  <  i  <  j  <  m}.  We 

then  partition  B  into  two  sets:  bigrams  of  linked  words,  i.e. 

L  =  {{vi,Vj)  :  {vi,Vj)  G  B;pv{vi)  =  vj  orpvivj)  =  Vi} 
and  unlinked  words  U  =  B  —  L.  Using  the  Bigram  Statis¬ 
tics  Package  described  in  Pedersen  (2001),  we  collected  bi¬ 
gram  statistics  over  the  entire  dev/train  corpus.  We  then 
computed  the  average  statistical  correlation  of  each  set  us¬ 
ing  a  variety  of  metrics  (loglikelihood,  dice,  (jp).  The 
results  indicated  that  bigrams  in  the  linked  set  L  were  more 
correlated  than  those  in  the  unlinked  set  U  under  all  met¬ 
rics.  We  repeated  this  experiment  with  the  development 
sentences  in  Chinese,  with  similar  results.  Although  this  is 
by  no  means  a  conclusive  experiment,  we  took  the  results  as 
an  indication  that  using  bigram  statistics  as  an  approxima¬ 
tion  of  a  parse  might  be  helpful  where  no  parse  was  actually 
available. 

To  incorporate  bigram  statistics  into  our  alignment 
model,  we  modified  the  scoring  function  in  the  following 
manner:  each  time  a  dependency  link  is  introduced  between 
words  and  we  do  not  have  access  to  the  source  parse,  we 
add  into  the  alignment  score  the  bigram  score  of  the  two 
words.  The  bigram  score  is  based  on  the  metric  com¬ 
puted  for  bigram  correlation.  We  call  this  .  The  resulting 
alignment  score  can  now  be  given  by  the  following  formula. 

^  4>\{Vi,Wj)+  (f)%{Wi,Wj) 

{i,j)-.a{i)=j  {i)=j/\pw  U)='‘ 

Our  results  indicate  that  using  Chinese  bigram  statistics 
in  conjunction  with  English  parse  trees  in  this  manner  re¬ 
sults  in  a  small  decrease  in  the  score  along  all  measures. 
Nonetheless,  there  is  an  intuitively  appealing  interpretation 
of  using  bigrams  in  this  way.  The  first  is  that  the  modifi¬ 
cation  of  the  scoring  function  provides  competitive  interac¬ 
tion  between  parse  information  and  cross-lingual  statistics. 
The  second  is  that  if  bigram  statistics  represent  a  weak  ap¬ 
proximation  of  syntax,  then  perhaps  the  iterative  refinement 
of  this  statistic  (e.g.  by  taking  counts  only  over  words  that 
were  linked  in  a  previous  iteration)  would  satisfy  our  objec¬ 
tive  of  syntactic  transfer.  It  is  not  clear  from  the  results  that 
this  is  the  case.  However,  it  does  provide  a  starting  point 
for  syntactic  statistics  that  is  not  available  if  we  use  only 
cross-lingual  statistics. 

4.7.  Results  of  Using  Better  Word  Statistics 

Our  results  show  that  using  parse  information  and 
coarse  cross-lingual  word  statistics  provides  a  modest  boost 
over  an  approach  using  only  the  cross-lingual  word  statis¬ 
tics.  We  also  decided  to  investigate  what  happens  when  we 
seed  our  algorithm  with  better  cross-lingual  statistics 

To  test  this,  we  initialize  our  co-occurrence  counts  from 
alignment  links  output  by  the  Giza-n-  alignment  of  our  cor¬ 
pus.  We  still  use  to  compute  the  correlation.  We  call  this 
Predictably,  using  the  better  word  correlation  statistics 


improves  the  quality  of  the  alignment  output  in  all  cases. 
In  this  scenario,  adding  parse  information  does  not  seem 
to  improve  the  alignment  score.  However,  parse  trees  in¬ 
duced  in  this  manner  achieve  a  higher  precision  than  any  of 
the  other  methods.  It  outscores  the  baseline  algorithms  by 
a  significant  amount,  and  produces  results  comparable  to 
the  baseline  of  Hwa  et  al.  (2002a).  It  is  important  to  note, 
however,  that  the  baseline  of  Hwa  et  al.  (2002a)  is  achieved 
only  after  the  application  of  linguistic  rules  to  the  output  of 
the  Giza-H-  alignment.  Additionally,  the  trees  themselves 
may  contain  errors  of  the  type  described  in  Section  2..  Our 
tree  precision  results  directly  from  the  application  of  our 
synchronous  parsing  algorithm,  and  all  of  the  output  trees 
are  valid  dependency  parses. 

5.  Future  Work 

We  believe  that  a  fundamental  advantage  of  our  baseline 
model  is  its  simplicity.  Improving  upon  it  will  be  consid¬ 
erably  easier  than  improving  upon  a  complex  model  such 
as  the  one  described  in  Brown  et  al.  (1990).  Improve¬ 
ments  may  proceed  along  several  possible  paths.  One  path 
would  involve  reformulating  the  scoring  functions  in  terms 
of  statistical  models  (e.g.  generative  models).  A  natural 
complement  to  this  path  would  be  the  introduction  of  it¬ 
eration  with  the  goal  of  improving  the  alignments  and  the 
accompanying  models.  In  this  approach,  we  could  attempt 
to  learn  a  coarse  statistical  model  of  the  syntax  of  the  low- 
density  language  after  each  iteration  of  the  alignment.  This 
information  could  in  turn  be  used  as  evidence  in  the  next 
iteration  of  the  alignment  model,  hopefully  improving  its 
performance.  Our  results  have  already  established  a  set  of 
statistics  that  could  be  used  in  the  initial  iteration  of  such 
a  task.  The  iterative  approach  resonates  with  an  idea  pro¬ 
posed  in  Yarowsky  and  Ngai  (2001),  regarding  the  use  of 
learned  part-of-speech  taggers  in  subsequent  alignment  it¬ 
erations. 

An  orthogonal  approach  would  be  the  application  of  ad¬ 
ditional  linguistic  information.  Our  results  indicated  that 
syntactic  knowledge  can  help  improve  alignment.  Ad¬ 
ditional  linguistic  knowledge  obtained  from  named-entity 
analyses,  phrasal  boundary  detection,  and  part-of-speech 
tags  might  also  improve  alignment. 

Although  our  output  dependency  trees  represent  def¬ 
inite  progress,  trees  with  such  low  precision  cannot  be 
used  directly  to  train  statistical  parsers  that  assume  correct 
training  data  (Chamiak,  2000;  Collins,  1999;  Ratnaparkhi, 
1999).  There  are  two  possible  methods  of  improving  upon 
the  precision  of  this  training  data.  The  first  is  the  use  of 
noise-resistant  training  algorithms  such  as  those  described 
in  (Yarowsky  and  Ngai,  2001).  The  second  is  the  possi¬ 
bility  of  improving  the  precision  yield  by  removing  obvi¬ 
ously  bad  training  examples  from  the  set.  Unlike  the  base¬ 
line  model,  our  word  alignment  model  provides  an  obvi¬ 
ous  means  of  doing  this.  One  possibility  is  to  use  a  score 
gleaned  from  the  alignment  algorithm  as  a  means  of  rank¬ 
ing  dependency  links,  and  removing  links  whose  score  is 
above  some  threshold.  We  hope  that  a  dual  approach  of 
improving  the  precision  of  the  training  examples,  while  si¬ 
multaneously  reducing  the  sensitivity  of  the  training  algo¬ 
rithm,  will  result  in  the  ability  to  train  a  reasonably  accurate 


statistical  parser  for  the  new  language. 

6.  Related  work 

Al-Onaizan  et  al.  (1999),  Brown  et  al.  (1990) 
and  Melamed  (2000)  focus  on  the  description  of  statisti¬ 
cal  translation  models  based  on  the  bag-of-words  model. 
Alignment  plays  a  crucial  part  in  the  parameter  estima¬ 
tion  methods  of  these  models,  but  they  remain  inadequate 
for  syntactic  transfer  for  reasons  described  in  Section  2.. 
The  work  of  Hwa  et  al.  (2002b)  includes  an  investigation 
into  the  combination  of  syntax  with  the  output  of  this  type 
of  model.  Och  et  al.  (1999)  presents  a  statistical  trans¬ 
lation  model  that  performs  phrasal  translation,  but  it  re¬ 
lies  on  shallow  phrases  that  are  discovered  statistically,  and 
makes  no  use  of  syntax.  Yamada  and  Knight  (2001)  cre¬ 
ate  a  full-fledged  syntax-based  translation  model.  However, 
their  model  is  unidirectional;  it  only  describes  the  syntax 
of  one  sentence,  and  makes  no  provision  for  the  syntax  of 
the  other.  Wu  (1995)  presents  a  complete  theory  of  syn¬ 
chronous  parsing  using  a  variant  of  context-free  grammars, 
and  exhibits  several  positive  results,  though  not  for  syn¬ 
tax  transfer.  Alshawi  and  Douglas  (2000)  present  the  syn¬ 
chronous  parsing  algorithm  on  which  our  work  is  based. 
Much  like  the  work  on  translation  models,  however,  this 
work  is  interested  in  alignment  primarily  as  a  mechanism 
for  training  a  machine  translation  system.  Variations  on 
the  synchronous  parsing  algorithm  appear  in  Alshawi  et  al. 
(2000a)  and  Alshawi  et  al.  (2000b),  but  the  algorithm  of 
Alshawi  and  Douglas  (2000)  appears  to  be  the  most  flexi¬ 
ble. 

7.  Conclusion 

We  have  described  a  new  approach  to  alignment  that 
incorporates  dependency  parses  into  a  synchronous  pars¬ 
ing  model.  Our  results  indicate  that  this  approach  results 
in  alignments  whose  quality  is  comparable  to  those  pro¬ 
duced  by  complicated  iterative  techniques.  In  addition,  our 
approach  demonstrates  substantial  promise  in  the  task  of 
learning  syntactic  models  for  resource-poor  languages. 
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A  Algorithm  Pseudocode 


The  following  code  is  as  general  as  possible  about  what  constitutes  a  legal  combination  of  subspans  for  an 
alignment.  This  is  because  legal  subspans  may  depend  on  input  constraints  (such  as  a  parse).  Implicit  in 
the  code  is  the  idea  that  the  legal  combinations  should  be  enumerated  in  a  reasonable  way.  That  is,  small 
spans  should  be  enumerated  before  larger  spans  that  may  be  constructed  from  them.  In  the  original  algorithm 
described  in  Alshawi  and  Douglas  (2000),  all  possible  combinations  of  subspans  across  both  languages  are 
legal. 

The  variables  iy  and  jv  denote  the  span  Viyyi...Vjy,  and  pv  denotes  a  partition  of  the  span  such  that  iy  < 
Pv  f:  jv-  The  variables  iw,  jw,  ond pw  ore  defined  analagously  on  W. 

Finally,  we  assume  that  we  have  a  chart  a,  which  contains  cells  indexed  by  iy,  jy,  iw,  ond  jw-  Each  cell 
contains  subfields  headPhrase,  modifier  Phrase,  and  cost. 

for  all  legal  combinations  of  iy,  jy,iw,  and  jw 

Oi{iv ,  jv  ,iw ,  jw)  =  fi‘^{viy+i-Vjy,Wi„+i...Wj„) 
for  all  legal  combinations  of  iy,  jv,  Pv,  iw,  jw,  andpvK 

consider  the  case  in  which  aligned  subphrases  are  in  the  same  order  in  both  languages 

headPhrase  =  a{iv , py ,  iw , Pw) 

modifier  Phrase  =  Oi(pv ,  jy ,  pw ,  jw) 

cost  =cost(headPhrase,  modifier  Phrase) 

if  cost  <  a{iy,jy,iw,jw) -cost  then 

Oi{iv,jv,  iw,jw)  =  new  subAlignment(/ieadP/irase,  modifier  Phrase,  cost) 
swapiheadPhrase,  modifier  Phrase) 
cost  =cost(headPhrase,  modifierPhrase) 
if  cost  <  a{iy,jy,iw,jw) -cost  then 

Oi{iv,jv,  iw,jw)  =  new  subAlignment(/ieadP/irase,  modifierPhrase,  cost) 
consider  the  case  in  which  aligned  subphrases  are  in  the  reverse  order  in  each  language 
headPhrase  =  Oi{iv ,  py ,  pw ,  jw) 
modifierPhrase  =  Oi(pv,jv,iw,Pw) 
cost  =cost(headPhrase,  modifierPhrase) 
if  cost  <  a(iv ,  jv  ,iw ,  jw)-cost  then 

Oi{iv,jv,  iw,jw)  =  new  subAlignment(/ieadP/irase,  modifierPhrase,  cost) 
swapiheadPhrase,  modifierPhrase) 
cost  =cost(headPhrase,  modifierPhrase) 
if  cost  <  a(iv ,  jv  ,iw ,  jw)-cost  then 

Oi(iv,jv,  iw,jw)  =  new  subAlignment(/ieadP/irase,  modifierPhrase,  cost) 
return  a(0,m,0,n) 


